SARS-CoV-2 protein structure and sequence mutations: Evolutionary analysis and effects on virus variants

The structure and sequence of proteins strongly influence their biological functions. New models and algorithms can help researchers in understanding how the evolution of sequences and structures is related to changes in functions. Recently, studies of SARS-CoV-2 Spike (S) protein structures have been performed to predict binding receptors and infection activity in COVID-19, hence the scientific interest in the effects of virus mutations due to sequence, structure and vaccination arises. However, there is the need for models and tools to study the links between the evolution of S protein sequence, structure and functions, and virus transmissibility and the effects of vaccination. As studies on S protein have been generated a large amount of relevant information, we propose in this work to use Protein Contact Networks (PCNs) to relate protein structures with biological properties by means of network topology properties. Topological properties are used to compare the structural changes with sequence changes. We find that both node centrality and community extraction analysis can be used to relate protein stability and functionality with sequence mutations. Starting from this we compare structural evolution to sequence changes and study mutations from a temporal perspective focusing on virus variants. Finally by applying our model to the Omicron variant we report a timeline correlation between Omicron and the vaccination campaign.


Introduction
Comprehension of cellular processes requires studying relations between the sequence of genes and the structure of encoded proteins [1,2] through genomic and proteomic studies. From an evolutionary point of view, changes in gene sequences (e.g. single nucleotide mutations) may imply a modification of protein structure and, thus, phenotype changes. The evolutionary process limits some phenotype modifications due to environmental constraints [1,3].
The pandemic of the SARS-CoV-2 virus has led to the storage of a large amount of genomic and proteomic datasets enabling the molecular evolutionary analysis of proteins thereby boosting studies of the relations between the protein sequence structure and functions of the virus [4,5]. The genome of SARS-CoV-2 contains 29.9 kilobase [6], and has 14 functional open reading frames (ORFs) and multiple encoding regions: (i) four structural proteins (i.e., Spike, S; envelope, E; membrane, M; and nucleocapsid, N); (ii) 16 nonstructural proteins (nsp1-nsp16) and (iii) accessory proteins [5,7,8].
Viruses undergo many mutations during the process of replication [9,10]. Mutations can occur randomly due to errors in replication steps and to changes in the structure of the viral proteins [11,12]. These mutations may acquire an evolutionary advantage when they improve their ability to escape from the host's immune system [13]. Many studies have identified mutations in SARS-CoV-2 [5,[14][15][16][17], and in the relation between virus mutations and the ability to override immune systems (e.g., relating the evolutionary process with the spread of the virus).
Sequence of mutations may cause the insurgence of variants. A variant is a viral genome with one or more mutations causing changes in protein structure and virus characteristics. Numerous studies have focused on the evolution of SARS-CoV-2 main variants, highlighting the increase of transmissibility vis-a-vis a reduction in severity, (e.g. Omicron variant [17]). The World Health Organisation (WHO) and the European Center for Disease Control (ECDC) are tasked with studying new evidence on variants. Sequenced samples of SARS--CoV-2 gathered from around the world are analysed by the WHO and ECDC for publishing information about variants and health protocols. Variants and available information are published [18][19][20] and in the ECDC/WHO register available on line. Variants are organised into lineages, groups of closely related viruses with a common ancestor. Although mutations occur very frequently during virus replication, only few modifications change the virus functionalities. A group of variants with similar genetic changes has been designed by the WHO as a Variant Being Monitored (VBM), Variant of Concern (VOC), or as a Variant of Interest (VOI) due to shared attributes and characteristics that may require government action to safeguard public health [21,22].
We focus on a subset of variants, responsible for S protein structure modifications, by analysing structural models gathered from the Exascalate4Cov consortium database [23]. We study the impact of sequence changes on Spike protein structure by means of the Protein Contact Network (PCN) formalism [24][25][26][27][28]. PCNs are graphs whose nodes represent the C − α atoms of the backbone of proteins, while edges represent a relative spatial distance of 4-8 Ångstroms among residues. Topological descriptors of PCNs, such as node centrality measures, are used to discover protein properties, even at the sub-molecular level [26,27]. We focus on the correlation between sequence and structure evolution and figure out relations between sequence updates, structure and phenotype variations. A similar model has been used to study Spike proteins of variants of SARS-CoV-2. We investigate how sequence changes generate relevant changes in the structure. We calculate centrality measures for each PCN node and their changes due to mutations. We also measure the structural differences among variants using the template modelling score (TM-SCORE) [29], a metric for assessing the topological similarity of protein structures. Moreover, using the Louvain algorithm we extract communities for each PCN considered [30] and we then relate them to mutations. Finally we build two trees, one considering the difference in network descriptors and the second regarding the distance in terms of TM-SCORE to clarify possible divergences between the timeline evolution of sequences and structure. We analysed each variant from three perspectives: the sequence, the structure, and the PCN network parameters. The obtained results allowed us to put forward the following claims: • The temporal evolution of the variants shows that both sequence and structure of Spike proteins changed significantly after the beginning of the large-scale vaccination campaign.
• The PCN analysis shows local changes in the protein structure of the studied variants, which can be related to protein folding and stability.
• PCN nodes centrality measures highlight the differences between Spike proteins in: (i) Omicron 1 variant versus Wild Type, and (ii) Delta variant versus Wild Type.
• Communities extracted with the Louvain algorithm represent correlations among amino acids which correspond to different functional domains of spike proteins, for example, with the Louvain algorithm communities can be mapped into the SARS-CoV2 Omicron 1 Spike variant structure.
• For S proteins, the net charge of RBD and NTD domain of the S protein, predicted from pK a values, increases with time.
In this paper we focus on: (i) the correlation between sequence changes and structural changes in the evolving Spike proteins of the SARS-CoV-2 variants; (ii) correlation between vaccination campaigns and Omicron variants mutations. Similar works [17,[31][32][33] covered the same problem: for example [17] focused on the impact of mutations on the virus pathogenicity, while [33] focused on the impact on binding affinity. In this paper, we also analyse parameters extracted from both sequence and structure, integrating different viewpoints. Unlike the works which consider different scales [32], this paper also analyses the impact of mutations at the intermediate level by using the PCN model [31] and considers a larger set of structures and parameters.

Materials and methods
We here propose the analysis of sequence mutations and structural changes using the pipeline described in Fig 1. The structures of the Spike protein of some selected virus variants are used as input to the pipeline. PDB files of the three-dimensional Spike structures are gathered from Exascalate4Cov consortium database [23]. Table 1 reports the variants used as input in the pipeline and the related mutations on Spike protein. For each Spike protein we compute a corresponding PCN using the PCN-Miner tool [31]. Each node of the PCN corresponds to a single amino acid of the corresponding protein, while an edge connects two nodes whose spatial distance is comprised of between 4 and 8 Angstroms [26]. For each node of the PCN a set of centrality measures is evaluated. We used the following: Betweenness, Degree, Eigenvector, Closeness, and Katz, centrality measures. A description and definition are reported in Table 2.

Measure Description Measure Definition
Betweenness Centrality measure: given a node i, it measures how much the node (i.e., amino acid) influences communication and serves as a bridge from one part of the graph (i.e., part of the represented protein) to another. σ j,k indicates the number of shortest paths from node j to node k and σ j,k (i) indicates the shortest path which includes i.
Degree Centrality measure: given a node i, it measures the normalised degree of i, i.e. the number of connections of the node. Nodes (amino acids) with a high centrality, considered hubs in the network (i.e., the protein), have a crucial role in the network communication. The degree centrality of a node i is computed as reported on the right.

C deg ðiÞ ¼ degðiÞ maxðdegreesÞ
Eigenvector Centrality measure: given a node i, the Eigenvector centrality measure how a node is connected to other central nodes. A high value means that a node i is connected to many high-score nodes. Let A be the adjacency matrix of a graph G. The eigenvector centrality C eig iof a node i is given by the formula C eig (i)= 1 is the set of neighbours and λ is a constant.
Closeness Centrality measure: given a node i, it measures the distance (closeness) among i (amino acid) to all graph nodes and it is evaluated as reported on the right, where d(i, j) is the shortest distance between i and j.

PCN analysis
We measure the centrality values of all the PCN nodes and compare both the overall changes (i.e., averaging the centrality values) and local changes (i.e., changes in the centrality values of mutated residues). We depict these values by using boxplots and radar plots. Boxplots are associated with all variants. Radar plots are used to represent centrality values of mutated nodes in the Spike variants. Finally, the obtained centrality measures are mapped onto the real protein structure using PCN-Miner. A t-test [34] on the variants centrality distribution is used to evaluate the significance of any changes. Community detection analysis has been performed with the Louvain algorithm with default parameters [30] to study the relation between virus mutations and communities in PCNs. The Louvain method allows us to find communities in graphs. Based on the greedy paradigm, which uses modularity graph information to optimise performance on large graphs, it works by finding small communities, each mapped into one node of a new graph. The process is repeated until all of the nodes in each small community have been covered. In our scenario, communities represent a set of related amino acid. After the communities have been extracted, we map them into functional regions, i.e., the RBD domain, and analyse the presence of mutations in such communities. The Louvain algorithm has been widely used in network analysis [35,36] to discover communities. We apply it here to PCNs to find communities overlapping with protein structures and to relate communities with protein functionalities. For each variant, we plot its mutations inside the communities to graphically show where most mutations end up. This allows us to identify a pattern in mutation distributions of the Louvain communities. In the case of a high percentage of mutations belonging to the same community, we can claim that the corresponding mutations share similar effects on protein functionality.

Structural analysis
We computed TM-scores [37] between pairs of Spike proteins of two different variants by using the US-align (Universal Structural alignment) software [38]. The TM-score quantify the structural similarity between proteins (see the upper part of Fig 1). Thus, the pipeline plots sequence distance and structural distance for Spike variants. We also evaluate contact similarity between PCNs variants. This is defined as the percentage of contacts/non-contacts shared between two PCNs, and should represent a measure of similar behaviours among subnetworks. Finally, for each variant, we computed the acid dissociation constant pK a for each amino acid of the analysed proteins using the PROPKA3 web server [39]. Given a node, pK a value is the −log 10 K a , where K a is the acid dissociation constant that measures amino acid acidity or basicity. Following the method proposed in [32], the pKa values were used to predict the overall domain charge, also known as EPrbd and EPntd, respectively, for RBD and NTD protein domains. The surface electrostatic potential has been calculated for each variant RBD and NTD by the APBS (Adaptive Poisson-Boltzmann Solver) software with default parameters [40].

Sequence analysis
We performed sequence alignment using the CLUSTALW software with default parameters [41]. Implementation. The methods described above were developed with the help of the Jupy-terLab environment [42]. The code was written in Python language version 3, plotly and matplotlib libraries (used to draw figures) [43], and PROPKA [39] library to calculate pK a of the proteins. The PCN-Miner [31] tool computed PCNs, node centralities, and Louvain algorithms to extract communities [30] for the Spike variants. PyMOL library [44] used to read, visualize and modify PDB files. The CUPSAT software [45] is used to predict changes in protein stability caused by a particular mutation of the Wild Type form. Finally, we used US Align [38] to perform a sequence-independent alignment based on the structural similarity of the variants and to compute the TM score.

Results
We analysed the sequence and structure of Spike proteins of fourteen selected SARS-CoV-2 variants using PDB files as input. For each PDB file we calculated a PCN. We then evaluate centrality measures (Betweennes, Degree, Eigenvector, Closeness, and Katz) for each PCN. Eigenvector centrality values have been calculated to show how protein structure varies. In particular, we note that when centrality values have little variation among variants, the global form of protein structure does not vary among variants, while local changes occur. All the evaluated centrality measures for variants are reported as boxplots in Fig 3. Boxplots evidence that the changes in Eigenvector Centrality are not significant considering both all nodes and only those corresponding to the mutated residues. Conversely, rewiring of the structure causes more changes in Katz Centrality values than other measures. Further measures confirmed the above results that can be found in Tables 3 and 4. Similarly to [24], the obtained eigenvector centrality evidence protein instability. In particular, the mutations of Omicron 1 and Delta variants cause Spike protein instability, according to [14]. Is also woth nothing that for the Wild Type virus, amino acids of the RBD domain Changes in centrality measures have been compared by using a t-test. A p-value less than 0,05 represent a significant change in average centrality values. Table 3 reports a comparison between Omicron 1 and all the other structures considering only mutated sites. Table 4 reports the same comparison for all the residues and Table 5 reports confidence intervals of average centrality measure values.
By using the here proposed pipeline we found a significant change of Katz centrality measures between Omicron 1 variant and all the other Spike variants (also considering the Wild Type one). Significant changes in betweenness centrality and degree centrality values have also been reported, where the significance is measured by using t-test analyses. Complete comparisons, code and numerical results can be found in references reported in Supporting information Section.
As reported in Fig 1, after the centrality analysis we analysed the node communities in each PCN were analysed by using the Louvain algorithm, as reported in  Table 6 summarises the extracted communities using the Louvain algorithm.
Protein structures were analyzed by means of TM-scores, structural distance, and PCN Contact similarity between Spike variants. Sequence distances between variants were used to construct a phylogenetic tree. Fig 8 reports the comparison of these measures. The figure shows that there is no direct correlation between sequence changes and structural modification. For instance, Iota1, and Zeta variants have dissimilar sequences but highly similar structures.
For each variant, we computed the net charge of RBD and NTD domain. In Fig 9 we show the RBD and NTD net charges of all the selected variants. The values are reported in Table 7.  Tables 3 and 4 show that changes in centrality values could signify protein instability. Considering Omicron 1 and Delta variants in particular (see Fig 4) the potential instability is due to the following specific mutations. E.g., the Omicron 1 variant had multiple mutations in the RBD domain, most of which associated with protein instability. The N501Y mutation is the only one this increased stability. Consequently, the Omicron 1 PCN presents an eigenvector centrality value that decreases for the RBD domain. The significance of changes in centrality measures was assessed by t-test analyses, comparing average values across nodes and nodes corresponding to mutations. Variations in centrality measures, including Katz centrality, between Omicron 1 and other variants (including the Wild Type) were found to be significant. Changes in betweenness centrality and degree centrality values were also reported. Community detection analysis using the Louvain algorithm identified communities within PCNs, which often correspond to functional domains of the Spike protein. Mutations occurring within the same community appeared to have similar   Fig 8, there is no correlation between sequence and structure similarity [28].

Results in
The acid dissociation constant (pKa) used to predict the net charge showed that protein Spike acquires a positive charge with time as shown in Fig 9. Valuable insights into the centrality and community structure of Spike protein variants were gained, suggesting potential protein instability, identifying mutation clusters, and assessing structural characteristics. Results contribute to understanding the behaviours and potential implications of different SARS-CoV-2 variants. Moreover, they have several important  biological implications. Topological analysis of protein contact networks can provide insight into the stability and function of proteins. The results regarding changes in centrality measures indicate protein instability, with the Omicron 1 variant showing decreased RBD eigenvector centrality due to mutations associated with protein instability. This information may be further analysed to shed light on the effects of viral mutations on transmissibility and changes in disease severity. This could be considered the first step towards the prediction of novel mutations as well as support for providing therapies and vaccines that target specific regions of the virus. Moreover, the community detection analysis revealed that mutations falling in the same community often have similar effects on protein structure and hence function. This may help to predict the effects of mutations on the virus's behaviour. We have shown that many changes occurred after the beginning of the vaccination campaign. These changes include modifications of protein structure as evidenced through TMscores, structural distance, and PCN contact similarity. This may help to explain how virus  mutations affect viral transmission, replication, or immune evasion. This information can be used to guide the design of drugs and vaccines that target specific regions of the virus and can help to predict the spread and evolution of new variants. We explore patterns of changes in a temporal dimension and compare the cumulative distribution of vaccination with the characteristics of the variant. Although we cannot infer any causality regarding vaccination driving the evolution, we should note that the presence of vaccinations in a timeline is located in the middle of the first variants of SARS-CoV-2 and Omicron. Considering also the clinical characteristics of Omicron in terms of vaccine escape and neutralization of immune response, we can assume that the effect of all Omicron changes may be related to the structural changes also revealed by the above-reported measures. Visualization of the mutations and the communities on the protein structure. Mutated residues are displayed as red spheres. In Omicron 1 , and Omicron 5 , the mutations seem to fall inside certain communities with the same function in the Spike Protein. We found that Omicron 1 has more than ten mutations that fall inside the same community.
https://doi.org/10.1371/journal.pone.0283400.g007 Table 6. Summary of communities and their mutations. As some mutations do not belong to any community, we report them here as belonging to a virtual community referred to as (UnModelled).

SARS-CoV-2 evolutionary analysis
Finally, our results show that Omicron variants clearly differ from the others. All the considered parameters confirm that there is a remarkable change in centrality values. In particular, the difference in terms of network invariants among Alpha, Beta, Delta, and Omicron has been studied also in [28]. The presented results here can be considered as a further extension of our previous in two ways. First, we confirm that the difference is not limited to network invariants, but also other protein structural measures, i.e. polarity and TM-Score confirm that Omicron is radically different from the others. We also note an increase of net charge of RBD domain over time [32] which may explain the increased transmissibility through a better binding affinity with the ACE2 receptor.
Finally, as we also show in Fig 10, Omicron 1 variants appeared after the start of the vaccination campaign. Even though this does not imply any causal relation [46,47], this relation may require additional studies to shed light on the relation between the vaccination campaign and the evolution of the SARS-CoV-2 virus.

Study limitations
While the study provides valuable insights into the centrality and community structure of Spike protein variants, there are some limitations to consider: 1. Data Availability: The study is based on the availability of protein structure encoded into PDB files, which might have limitations in terms of availability, completeness, and accuracy. 2. Simplified Representation: The analysis considers only protein structures and network structures while other factors such as protein dynamics, ligand interactions, and post-translational modifications might provide more insights.
3. Lack of Experimental Validation: As the findings are based on in-silico, experimental validation is needed to verify structural changes.
4. Limited Sample Size: As the analysis focuses on a specific set of Spike protein variants, the inclusion of a larger set of variants could corroborate the results.
Further research addressing these limitations could contribute to a fuller understanding of the implications of Spike protein variants on protein structure and function.

Conclusion
In this work we presented a pipeline for analysing mutations of the SARS-CoV-2 genome, focusing on the impact of the mutations on Spike protein. From a timeline perspective, we observed that the Omicron variant presents significant changes with respect to the previous ones and Omicron appeared in parallel with the worldwide vaccination campaign. Omicron's structure presents many changes in the RBD domain and many mutations fall within the same structural region. A final remark, therefore we would argue that further studies should focus on the possible relationship between the vaccination campaign and the of Omicron appearance.